337 research outputs found

    Cloning and sequence analysis of cDNAs encoding the cytosolic precursors of subunits GapA and GapB of chloroplast glyceraldehyde-3-phosphate dehydrogenase from pea and spinach

    Get PDF
    Chloroplast glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is composed of two different subunits, GapA and GapB. cDNA clones containing the entire coding sequences of the cytosolic precursors for GapA from pea and for GapB from pea and spinach have been identified, sequenced and the derived amino acid sequences have been compared to the corresponding sequences from tobacco, maize and mustard. These comparisons show that GapB differs from GapA in about 20% of its amino acid residues and by the presence of a flexible and negatively charged C-terminal extension, possibly responsible for the observed association of the enzyme with chloroplast envelopes in vitro. This C-terminal extension (29 or 30 residues) may be susceptible to proteolytic cleavage thereby leading to a conversion of chloroplast GAPDH isoenzyme I into isoenzyme II. Evolutionary rate comparisons at the amino acid sequence level show that chloroplast GapA and GapB evolve roughly two-fold slower than their cytosolic counterpart GapC. GapA and GapB transit peptides evolve about 10 times faster than the corresponding mature subunits. They are relatively long (68 and 83 residues for pea GapA and spinach GapB respectively) and share a similar amino acid framework with other chloroplast transit peptides

    FAAST: Flow-space Assisted Alignment Search Tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High throughput pyrosequencing (454 sequencing) is the major sequencing platform for producing long read high throughput data. While most other sequencing techniques produce reading errors mainly comparable with substitutions, pyrosequencing produce errors mainly comparable with gaps. These errors are less efficiently detected by most conventional alignment programs and may produce inaccurate alignments.</p> <p>Results</p> <p>We suggest a novel algorithm for calculating the optimal local alignment which utilises flowpeak information in order to improve alignment accuracy. Flowpeak information can be retained from a 454 sequencing run through interpretation of the binary SFF-file format. This novel algorithm has been implemented in a program named FAAST (Flow-space Assisted Alignment Search Tool).</p> <p>Conclusions</p> <p>We present and discuss the results of simulations that show that FAAST, through the use of the novel algorithm, can gain several percentage points of accuracy compared to Smith-Waterman-Gotoh alignments, depending on the 454 data quality. Furthermore, through an efficient multi-thread aware implementation, FAAST is able to perform these high quality alignments at high speed.</p> <p>The tool is available at <url>http://www.ifm.liu.se/bioinfo/</url></p

    FAAST: Flow-space Assisted Alignment Search Tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High throughput pyrosequencing (454 sequencing) is the major sequencing platform for producing long read high throughput data. While most other sequencing techniques produce reading errors mainly comparable with substitutions, pyrosequencing produce errors mainly comparable with gaps. These errors are less efficiently detected by most conventional alignment programs and may produce inaccurate alignments.</p> <p>Results</p> <p>We suggest a novel algorithm for calculating the optimal local alignment which utilises flowpeak information in order to improve alignment accuracy. Flowpeak information can be retained from a 454 sequencing run through interpretation of the binary SFF-file format. This novel algorithm has been implemented in a program named FAAST (Flow-space Assisted Alignment Search Tool).</p> <p>Conclusions</p> <p>We present and discuss the results of simulations that show that FAAST, through the use of the novel algorithm, can gain several percentage points of accuracy compared to Smith-Waterman-Gotoh alignments, depending on the 454 data quality. Furthermore, through an efficient multi-thread aware implementation, FAAST is able to perform these high quality alignments at high speed.</p> <p>The tool is available at <url>http://www.ifm.liu.se/bioinfo/</url></p

    Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes

    Get PDF
    Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices. Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins. The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.Comment: Table 9 in this article includes corrections for errata in the Table 9 published in 10.1371/journal.pone.0017244. Supporting information is attached at the end of the article, and a computer-readable dataset of the ML estimates of selective constraints is available from 10.1371/journal.pone.001724

    Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution

    Get PDF
    Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models

    Subfamily specific conservation profiles for proteins based on n-gram patterns

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{<it>n,m</it>}) which are sets of <it>n </it>residues and <it>m </it>wildcards in windows of size <it>n+m</it>. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.</p> <p>Results</p> <p>The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.</p> <p>Conclusion</p> <p>Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.</p

    Scaling properties of protein family phylogenies

    Get PDF
    One of the classical questions in evolutionary biology is how evolutionary processes are coupled at the gene and species level. With this motivation, we compare the topological properties (mainly the depth scaling, as a characterization of balance) of a large set of protein phylogenies with a set of species phylogenies. The comparative analysis shows that both sets of phylogenies share remarkably similar scaling behavior, suggesting the universality of branching rules and of the evolutionary processes that drive biological diversification from gene to species level. In order to explain such generality, we propose a simple model which allows us to estimate the proportion of evolvability/robustness needed to approximate the scaling behavior observed in the phylogenies, highlighting the relevance of the robustness of a biological system (species or protein) in the scaling properties of the phylogenetic trees. Thus, the rules that govern the incapability of a biological system to diversify are equally relevant both at the gene and at the species level.Comment: Replaced with final published versio

    Amino acid "little Big Bang": Representing amino acid substitution matrices as dot products of Euclidian vectors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Sequence comparisons make use of a one-letter representation for amino acids, the necessary quantitative information being supplied by the substitution matrices. This paper deals with the problem of finding a representation that provides a comprehensive description of amino acid intrinsic properties consistent with the substitution matrices.</p> <p>Results</p> <p>We present a Euclidian vector representation of the amino acids, obtained by the singular value decomposition of the substitution matrices. The substitution matrix entries correspond to the dot product of amino acid vectors. We apply this vector encoding to the study of the relative importance of various amino acid physicochemical properties upon the substitution matrices. We also characterize and compare the PAM and BLOSUM series substitution matrices.</p> <p>Conclusions</p> <p>This vector encoding introduces a Euclidian metric in the amino acid space, consistent with substitution matrices. Such a numerical description of the amino acid is useful when intrinsic properties of amino acids are necessary, for instance, building sequence profiles or finding consensus sequences, using machine learning algorithms such as Support Vector Machine and Neural Networks algorithms.</p
    corecore